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Abstract:- We propose a comparative study on single imputation techniques such as Mean, Median, and Standard 
Deviation combined with k-NN algorithm. Training set with their corresponding class groups the data of different 
sizes. The above techniques are applied in each group and the results are compared. Median/ Standard Deviation 
shown better result than Mean Substitution. 
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I. INTRODUCTION 

Missing data is one of the problems which are to be solved for real-time application. Traditional and Modern 
Methods are there for solving this problem. The variables may be of Missing Completely at Random, Missing at Random, 
Missing not at Random. Each variable should be treated separately. k-NN algorithm is used to group the data set into 
different groups. The training examples are vectors in a multidimensional feature space, each with a class label. The training 
phase of the algorithm consists only of storing the feature vectors and class labels of the training samples. In the 
classification phase, k is a user-defined constant, and an unlabeled vector is classified by assigning the label which is most 
frequent among the k training samples nearest to that query point. Usually Euclidean distance is used as the distance metric. 
After grouping of data, missing data in each group is imputed by Mean/ Median/Standard Deviation. The results are 
compared in different percentage of accuracy. 

II. LITERATURE SURVEY 

Rubin and Little have defined three classes of them: missing completely at random (MCAR), missing at random 
(MAR), and not missing at random (NMAR).In the case of statistical modelling, when external knowledge is available about 
the dependencies between the missing values as well as about the dependencies between missing and observed data, one 
might use the Markov or Gibbs processes for imputation. Likelihood based tests have been proposed by Fuchs (1982) for 
contingency tables, and by Little (1988) for multivariate normal data. A nonparametric test has been proposed by Diggle 
(1989 for preliminary screening. Rideout and Diggle (1991) have proposed a parametric test which requires the modelling of 
the missing-data mechanism. Chen and Little (1999) have generalized Little's (1988) basic idea of constructing test statistics. 
They avoid distributional assumptions, whereas Little (1988) assumed normal data. Some specific tests for linear regression 
models have been proposed too. Simon and Simonoff (1986) have written an article in which they describe tools for MAR 
diagnostic and for other purposes. They make no assumptions about the nature of the missing value process. Simonoff has 
introduced (1998) a test to detect non-MCAR mechanisms. His diagnostics are based on standard outlier and leverage-point 
regression diagnostics. Recently Toutenburg and Fieger (2001) introduced methods to analyse and detect non-MCAR 
processes for missing covariates. They use an outlier detection to identify non-MCAR cases. 

III. SINGLE IMPUTATION TECHNIQUES 

A. Mean Substitution 

The most commonly practiced approach is mean substitution — single imputation techniques. Mean substitution 
replaces missing values on a variable with the mean value of the observed values. The imputed missing values are contingent 
upon one and only one variable - the between subjects mean for that variable based on the available data. Mean substitution 
preserves the mean of a variables distribution; however, mean substitution typically distorts other characteristics of a 
variables distribution. 

B. Median Substitution 

Mean or median substitution of covariates and outcome variables is still frequently used. This method is slightly 
improved by first stratifying the data into subgroups and using the subgroup average. Median imputation results in the 
median of the entire data set being the same as it would be with case deletion, but the variability between individuals' 
responses is decreased, biasing variances and covariances toward zero. 

C. Standard Deviation 

The standard deviation measures the spread of the data about the mean value. It is useful in comparing sets of data 
which may have the same mean but a different range. The Standard Deviation is given by the formula 
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If each sample in our data set has n attributes which we combine to form an n-dimensional vector: x = (xl, x2, . 
.xn). These n attributes are considered to be the independent variables. Each sample also has another attribute, denoted by y 
(the dependent variable), whose value depends on the other n attributes x. We assume that y is a categoric variable, and there 
is a scalar function, f, which assigns a class, y = f(x) to every such vectors. We suppose that a set of T such vectors are given 
together with their corresponding classes: x(i), y(i) for i = 1, 2, . . . , T. This set is referred to as the training set. 

The idea in k-Nearest Neighbor methods is to identify k samples in the training set whose independent variables x 
are similar to u, and to use these k samples to classify this new sample into a class, v. f is a smooth function, a reasonable 
idea is to look for samples in our training data that are near it (in terms of the independent variables) and then to compute v 
from the values of y for these samples. The distance or dissimilarity measure can be computed between samples by 
measuring distance using Euclidean distance. 
The Euclidean distance between the points is 
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The simplest case is k = 1 where we find the sample in the training set that is closest (the nearest neighbor) to u 
and set v = y where y is the class of the nearest neighboring sample. For k-NN , find the nearest k neighbors of u and then 
use a majority decision rule to classify the new sample. The advantage is that higher values of k provide smoothing that 
reduces the risk of over-fitting due to noise in the training data. In typical applications k is in units or tens rather than in 
hundreds or thousands. Notice that if k = n, the number of samples in the training data set, we are merely predicting the class 
that has the majority in the training data for all samples irrespective of u. 



IV. EXPERIMENTAL ANALYSIS AND RESULT 

A dataset consisting of 5000 records with 5 variables has been taken for analysis. The test dataset is prepared with 
some data's missing. The missing percentage varies as 2, 5, 10, 15 and 20. k-NN algorithm is implemented with different 
training data and its corresponding class. The group size also differs as 3, 6 and 9. Each group is separately assigned with 
mean, median and standard deviation. The results are shown in the below table. 
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15 


64 


72 


72 
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60 


72 


72 
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78 


77 


50 


80 
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The below table shows the average percentage of the above table. The result shows that median and standard 
deviation has some improvement over mean substitution. There is also a gradual improvement in the percentage of accuracy 
in case of different sizes of groups. 





Table 2: Avera 


ge of the above Method 


Percentage of 
Missing 


Mean 
Substitution 


Median 
Substitution 
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76 


76 


15 


65 


76 
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55 
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76 
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V. CONCLUSION AND FUTURE WORK 

k-NN algorithm is one of the famous classifier for grouping up of data .Traditional Methods such us Mean/ 
Median and Standard Deviation is used to improve the performance of accuracy in missing data imputation. This can be 
further enhanced by comparing with some other machine learning techniques like SOM, MLP. 
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